A systematic review on the direct approach to elicit the demand-side cost-effectiveness threshold: Implications for low- and middle-income countries

Several literature review studies have been conducted on cost-effectiveness threshold values. However, only a few are systematic literature reviews, and most did not investigate the different methods, especially in-depth reviews of directly eliciting WTP per QALY. Our study aimed to 1) describe the different direct approach methods to elicit WTP/QALY; 2) investigate factors that contribute the most to the level of WTP/QALY value; and 3) investigate the relation between the value of WTP/QALY and GDP per capita and give some recommendations on feasible methods for eliciting WTP/QALY in low- and middle-income countries (LMICs). A systematic review concerning select studies estimating WTP/QALY from a direct approach was carried out in seven databases, with a cut off date of 03/2022. The conversion of monetary values into 2021 international dollars (i$) was performed via CPI and PPP indexes. The influential factors were evaluated with Bayesian model averaging. Criteria for recommendation for feasible methods in LMICs are made based on empirical evidence from the systematic review and given the resource limitation in LMICs. A total of 12,196 records were identified; 64 articles were included for full-text review. The WTP/QALY method and values varied widely across countries with a median WTP/QALY value of i$16,647.6 and WTP/QALY per GDP per capita of 0.53. A total of 11 factors were most influential, in which the discrete-choice experiment method had a posterior probability of 100%. Methods for deriving WTP/QALY vary largely across studies. Eleven influential factors contribute most to the level of values of WTP/QALY, in which the discrete-choice experiment method was the greatest affected. We also found that in most countries, values for WTP/QALY were below 1 x GDP per capita. Some important principles are addressed related to what LMICs may be concerned with when conducting studies to estimate WTP/QALY.


Introduction
Due to increasing health expenditures and scarcity in resources, policymakers for health care are facing the challenges of how to allocate health care resources efficiently.Cost-utility analyses have gained popularity in health technology assessments, as they apply quality-adjusted life years (QALYs) as health outcomes, which enables comparisons across different disease and treatment programs [1].A relevant question would then be how to assign the relevant monetary value to each QALY [2], i.e., how much money are governments willing to spend on additional QALYs?Following this line of thought, it means that based on results from a cost-utility analysis, health technology below a certain national threshold value (cost per QALY) will be considered cost-effective and thus reimbursed [3,4].Such information is helpful for better consistency and transparency in reimbursement decisions in health care.As low-and middleincome countries (LMICs) are facing even higher resource scarcities, it becomes even more important for LMICs to have an appropriate threshold value for reimbursement decisions within health care [5].
Threshold values have been established in Europe [3,, the US [27][28][29][30][31][32][33][34][35][36], and a few Asian countries, such as Iran [37][38][39][40][41], Thailand [42][43][44], Japan [45][46][47], China [48,49], and Malaysia [50], but only two studies were conducted in LMICs, including Thailand in 2008 [42] and Vietnam in 2018 [51].Although World Health Organization (WHO) had no longer recommended a threshold value between 1-3 times the gross domestic product (GDP) per capita per DALY averted [52][53][54], in countries that lack their own threshold values, this value has still often applied, especially in LMICs [55].Furthermore, both DALYs and QALYs translate the impact of non-fatal health effects into a life year measure, so that the years of life lived in different health states or lost to premature fatality can be combined into a single indicator [53].Therefore, in practice, most countries use this value for QALY as well.However, it is quite often argued that the WHO recommendation might lack empirical evidence, and it might lead to inappropriate decisions regarding treatment adoption and resource allocation in health care services [53,54,56], as seldom the WTP/QALY exceeds 1 x GDP per capita, if one applies the 2-3 x GDP per QALY, might exhaust the national health budget.
The threshold value varies largely across countries, as health systems and affordability differ [54,57], and methods for eliciting threshold values also vary considerably [42,44,58]; however, thus far, there has been no agreement on which method can be considered the standard method [59].There are two well-known conceptual perspectives used to derive such threshold values: the supply-side opportunity cost perspective and the demand-side willingness to pay (WTP) perspective [54,56].The former perspective focuses on identifying the opportunity cost resulting from the disinvestment required to adopt a new technology [2,54], while the latter refers to the willingness to pay for a small health gain and then aggregating the WTP needed for a QALY [2,54].The supply-side perspective also requires comprehensive and comparable information on the cost per QALY of all interventions and thus is less used in practice relative to the demand-side WTP [41].
For the demand-side WTP, two general approaches are used: 1) directly eliciting individuals' WTP by using surveys and 2) indirectly inferring a value of health gain by estimating WTP for reductions in mortality or willingness to accept a risk, which is also known as the value of statistical life (VSL) method [2,59,60].To date, most studies have applied the first approach [2,59].
The process of directly eliciting WTP per QALY generally involves three steps (Fig 1 ): 1) estimating health gain in terms of health preference, 2) eliciting the WTP for that health gain, and 3) combining the estimates from steps 1 and 2 to estimate WTP for a QALY (2).In terms of estimating health gain in step 1, one can elicit health preference by either using a health preference measure (direct method) or via multi-attribute utility measures (indirect method) [61,62] Several literature review studies have been conducted to evaluate the implementation of different methods [1,2,49,55,57,59,[63][64][65][66].However, only a few are systematic literature reviews [49,57,59,64], where the rest are overviews or narrative reviews [1,2,55,60,63,65].Most of these reviews did not investigate the different methods in eliciting threshold values, especially in-depth reviews of the directly eliciting WTP per QALY, which are lacking [1,55,56,63,64,66].Two systematic reviews explored how different methods might impact the threshold value [1,63,65]; however, no study applied a regression technique to incorporate all the relevant methodological characteristics simultaneously, and little is known regarding which methodological characteristics are most influential.
This aim of the study is to 1) describe the different methods that have been used for eliciting WTP/QALY with the direct approach; 2) investigate which factors contribute most to the level of values of WTP/QALY; and 3) investigate the relation between the value of WTP/QALY and GDP per capita and give some recommendations regarding which methods might be more feasible for eliciting WTP/QALY in the LMICs.

Study design
This systematic review was carried out following PRISMA guidelines [67] to document the knowledge gap regarding how WTP per QALY was elicited, identifying all influential factors.

Data sources and search strategy
A systematic search with a publication restriction from January 2000 to March 2022 was conducted in seven databases, including PubMed, Embase, Psycinfo, Centre for Reviews and Dissemination (CRD), Cumulative Index to Nursing and Allied Health Literature (CINAHL), EconLit, and International HTA.
Search terms were constructed based on PICOS domains (Population, Intervention, Comparison, Outcomes, and Study design) [68] with O for WTP in combination with QALY.The detailed search strategies are shown in S2 Text.In addition, we also reviewed all references of the included studies in case some eligible studies had not been identified through the search.

Inclusion and exclusion criteria
Original studies conducted in any country were included if they elicited WTP per QALY in health-related issues by a direct approach.Studies were excluded if they were (i) not available as a full-text paper (available only as an abstract or poster); (ii) not written in English; (iii) just a literature review; or (iv) applying an indirect approach that used VSL.
Critical appraisal of studies: Quality assurance process.Two investigators independently performed abstract screening, full-text reviews, information extraction and quality assessment.Disagreements were resolved by consensus in discussion with the rest of research team.
Quality assurance was implemented in four steps: (i) All records identified through database searching were imported into the reference library software Zotero 5.0.92;and, duplicates of these records were excluded by either a merging tool or Zotero.(ii) After removing duplicates, the titles and abstracts of these articles were screened.(iii) The full-text articles were assessed for eligibility to fulfill the selection criteria.(iv) The quality of articles was appraised by using the Appraisal tool for Cross-Sectional Studies (AXIS tool) with 20 components developed by Downes et al. [69] in 2016 in S1 Table .Each question in the AXIS tool was answered as "yes", "no", "unclear", or "not applicable."

Information extraction and data preparation
Information on the full text was extracted using a standard extraction form approved by the research group.The details of the extracted information are presented in S3 Text.Moreover, data for gross domestic product (GPD) per capita for each study were also retrieved from the World Bank [70] based on the reporting year (or year of publication if the reporting year was unavailable) and country of study.

Data analysis
Descriptive analyses were used to describe the extracted data.Continuous variables are expressed as the mean (standard deviation (SD)) and median (interquartile range (IQR)), and for categorical variables, counted frequency and percentage were applied.To compare the threshold value across different countries and time periods, the ratios of WTP per QALY divided by GDP per capita were extracted or estimated if lacking this value.The different currencies were firstly converted to US dollars using the exchange rate in the reporting year, and then converted to international dollars (i$) values in 2021 by using the country's consumer price index (CPI) [71] and purchasing power parity (PPP) [71,72].The Kruskal-Wallis analysis was applied to test the WTP per QALY differences between category groups.
To evaluate which factors could influence WTP per QALY, the Bayesian Model Averaging (BMA) method was applied to select candidate covariates.The BMA approach could address the uncertainty in the variable selection process by selecting a number of all possible models and performing all inferences and predictions via the posterior probabilities of these models [65,73].The model with the lowest Bayesian information criterion (BIC) and the highest posterior probability was the best selected model [74].The factors were assessed, including year of publication, reporting year, continent, number of scenarios, options of scenarios, subjects, mode of administration, number of WEM, number of UEM, kind of WEM and kind of UEM.
All statistical analyses were performed in R version 4.0.0, and a p value < 0.05 was considered statistically significant.

Criteria for recommendation for feasible methods in LMICs
The recommendations are made based on 1) empirical evidence from the systematic review, which method might be most scientifically approved and applied; and 2) given the resource limitation in LMICs, which methods are most feasible in terms of data availability within the budget constraints.

Study selection
The study selection process is presented as a PRISMA flow diagram in Fig 2 .The search terms in the seven databases yielded a total of 12,196 records, and 3,471 records were removed due to duplication, leaving 8,725 records for title and abstract screening.Based on the inclusion/ exclusion criteria, 8,530 records were excluded.In total, 195 articles were reviewed as full-text, among which 131 were excluded for the following reasons: duplicated (n = 3), not eliciting WTP/QALY value (n = 71), literature reviews (n = 21), not available in full text (n = 22), not in English (n = 5), and indirect approach (n = 9).Overall, 64 articles were used for data extraction.

Study characteristics
General characteristics of the studies.The study characteristics are reported in Table 1.The results from the review suggested that most articles (82.8%) were published after 2010; the number of publications conducted in the five years from 2015 to 2020 was equal to the total number of those published before 2015.Studies were mostly from Europe (48.4%) and Asia (35.9%).More than 70% of the studies were from high-income countries, nearly 30% were from middle-income countries (upper middle-income countries-27%, lower middle-income countries-3%), and no study was found in low-income countries.The majority restricted the scope to within a country (95.3%), and only three studies (4.7%) were conducted in multiple countries.Most studies had a first author affiliation from universities (71.9%), funding sources (70.3%), and no conflict of interest (67.2%).
Characteristics of the research design.The characteristics of the method related to eliciting WTP per QALY are reported in Table 2. Individual perspectives were mostly used (78.1%),followed by societal perspectives with exclusive or inclusive individual ones (9.4%).Most studies used collected data from the general population (70.3%), face-to-face interviews (40.6%) or web-based surveys (39.1%).The sample size was mostly over 1000 participants, followed by 100-500 people.
Regarding scenarios, most studies selected 2 to 5 scenarios (31.3%).The ex-ante context of the hypothetical scenario, which asked how much participants not yet suffering from an illness would pay to lower their risks, is more likely to be used than the ex-post context, which asked respondents already suffering from an illness to pay for specific treatment (42.2% versus 37.5%).The type of hypothetical scenario labelled as unspecified disease/illness was the most used (62.5%).The common type of QALY gain was improving quality of life (60.9%)with unfixed/closed value gain in which respondents did not know the size of the gain (66.1%).The most popular duration of the hypothetical scenario is the period from 1 month to 1 year (39.1%).In addition, 34.4% of the studies used lump-sum payments.Nearly 80% of the studies used regression analysis to analyze the influencers on WTP per QALY value.
Characteristics of methods to elicit health preference.Details regarding the methods used for eliciting health preferences are reported in Table 3. Methods for eliciting preference vary largely across studies, among which the directly elicited health preference methods were mostly applied (43.8%), followed by the indirectly elicited health preference methods which  are known as the preference-based quality of life measures (PBM) (34.4%).Among the direct methods (standard gamble (SG), time trade-off (TTO), and the visual analog scale (VAS)), the majority applied mixed methods (10 out of 27, 37.0%) and VAS (9 out of 27, 33.3%); and among the PBM, the majority applied the EQ-5D instrument (19 out 22 studies, 86.4%).It is difficult to tell whether 3L or 5L was more popular, as 7 studies did not report which EQ-5D version was applied.Among those that applied the EQ-5D instrument, most studies (13 out of 19) applied both the EQ-5D index and EQ VAS; however, a few studies (7 out of 19) presented both values.Among the 6 studies that also used mixed methods, the mix types varied and were heterogeneous because no studies used the same mix method.

Characteristics of the willingness to pay-eliciting method
Details regarding how the WTP questions are addressed are reported in Table 4.The majority of studies applied the contingent valuation method (89.1%), among which most mixed more than two approaches (28 out of 57 studies), usually either a bidding game or an open-ended question with other approaches.For studies that applied only one approach, the bidding game (n = 7) and double-bound dichotomous choice question (n = 7) were mostly used.Characteristics of the WTP/QALY combination method.Table 5 shows the characteristics of the WTP/QALY combination method.Approximately one-third of the studies used the aggregated method to combine WTP per QALY.Moreover, 28.1% applied the disaggregated method, 7.8% combined both the aggregated and disaggregated methods, and approximately 10.9% applied the regression method.Approximately, 15.6% of the studies did not state which method they applied as a combination method.

Results of WTP per QALY
The results for WTP per QALY by study, country and year are reported in S2 Table .For an overview and easy comparison, WTP per QALY by country after conversion into international

Influential factors to WTP/QALY
Subgroup analysis.The detailed results of the subgroup analysis are reported in S4 Text.In general, there were differences in WTP per QALY values between subgroups, and the difference was statistically significant.
Multivariate regression analysis.The results from the multivariate analyses are reported in Table 6.Among the 169 evaluated models from BMA, Model 1 is the best model (BIC = -21.43,post probability = 0.046).The influential factors are type of country income (lower middle-income), type of QALY gain (a combination of improving quality of life, extending life, saving life, others or not applicable), context of hypothetical scenario (both ex post and ex ante), duration of hypothetical scenario (>1 year), sample size (501-1000, >1000, not reported), mode of administration (other combination), type of willingness to pay (discrete), specific willingness to pay eliciting methodology (DBDC), payment vehicle (none, not clearly stated) and utility elicitation method (EQ-5D and TTO; EQ-5D-3L index; both the EQ-5D-3L index value and EQ VAS score are used, either the EQ-5D-5L index value or EQ VAS score was used but not specified in the study; SF-6D; combination of VAS/SG/ TTO).The factors that had the posterior probability, or probability that the variables affected the mean willingness to pay per QALY of 100%, are important to the discrete-choice experiment method.The factors had a negative value and thus an opposite direction effect on the WTP/QALY value; conversely, the factors with positive value had the same direction effect.This model explained 34.1% (r 2 = 0.341) of the difference in the variance of the mean willingness to pay per QALY.
Quality assessment.Results of the quality appraisal of studies using the AXIS tool were presented in S4 Table .All studies defined clearly the objective and target population and had appropriate study designs.Most studies appropriately measure the value of WTP per QALY (96.9%) by using the instruments that had been piloted or published previously (92.2%).Most of them described sufficiently their method (98.4%) and statistical significance (89.1%).However, very few studies adjusted the sample size (9.4%) and non-responders (6.3%).Regarding the results, most studies described adequately the data on WTP and health preference (98.4%), results for analyses (95.3%), and limitations (81.3%).More than half of the studies (65.6%) reported that the study results were not affected by funding sources or conflicts of interest.

Discussion
We found that the methods for deriving WTP/QALY vary largely across studies, which is consistent with previous findings [59,64].The societal perspective, perspective of healthcare provider, type of QALY gain of extending life, and the context of hypothetical scenarios concerning both ex post and ex ante contribute the most to the level of values of WTP/QALY.We also found that in most countries, values for WTP/QALY were below 1 x GDP per capita.
In the following sections, we address some important principles related to what LMICs may be concerned about when conducting studies to estimate WTP/QALY.To begin, relative to the supply-side approach, LMICs may contemplate the adoption of a demand-side direct approach (WTP/QALY), as a means to establish a national threshold value.Several justifications underlie this choice: Firstly, in the past decade, several HICs have focused on the supplyside approach, such as England [63], Spain [76], Sweden [77], The Netherlands [78], Australia [79] which may be more relevant to inform decision making on resource [63,80].However, this approach necessitates the availability of substantial and comparable datasets within the health sector, encompassing data for healthcare expenditure and health outcomes, alongside variables to control for healthcare necessity [80,81].Regrettably, such comprehensive data is often scarce within LMICs, rendering the demand-side approaches more operationally viable [63].Secondly, the demand-side approach assumes that the health budget is not finite but fluctuates with response to changing healthcare requirements [54].This assumption aligns more closely with real-world dynamics, as the health care budget can be compensated by the state budget when it faces deficits.Among the demand-side WTP methods, the indirect approach using VSL also requires sufficient data on employment and workplace fatalities, which may also not be available in LIMCs.Moreover, the VSL method involves scenarios with a very small reduction in mortality, which can derive higher thresholds relative to the WTP/QALY direct approach [59,82].Therefore, it might be more feasible for LMICs to establish the national threshold by using the direct approach for the demand-side method (estimating WTP/QALY).However, it is crucial to carefully consider methodological rigor, generalizability, and ethical implications in order to ensure the validity and applicability of the results.It requires collaborative efforts that involve policymakers, researchers, and stakeholders to establish robust and widely accepted cost-effectiveness thresholds using the direct approach in LMICs.

Perspective
Most studies (78.1%) applied the individual perspective, where the respondents made the choice that maximizes his/her own benefit along the principles in Welfarism.However, we, like Bobinac et al. (2013), judge the theoretical reasons for a societal perspective more convincing.The social value of a QALY is defined as the amount of consumption that individuals are willing to forego to contribute to a health gain achieved in society.This gain may, or more frequently may not, accrue to the payer.We thus think that social value is the most reasonable construct in a society with collectively funded health care.Citizens pay regularly regardless of whether they at any particular point in time need health care, and the incremental cost at the time point of consumption of health care is relatively small.

Population
Most studies (70.3%) used the general population as the study sample, as it contains a heterogeneous population, and the results can, based on this, be generalized [83].A smaller number of studies used patients, clinicians or politicians as respondents, which limits the findings to certain health conditions [42], hence the results might not be generalizable to other population groups.Therefore, the recommendation of using the general population as the study sample, is mainly based on argument about generalizability, as the threshold value is for reimbursement decision at national level, which affects everyone in the country.Second, it is relatively easy to enrol enough respondents among general population than other specific groups, i.e., patient group.Thus, this sample type requires less effort to select which might be favor by LMICs.Accordingly, for generalizability and feasibility, we would recommend using the general population as a study sample.

Sample size
The sample size varies across studies, from below 100 to above 1000 respondents.However, only 9.4% of the studies (n = 6) [42,47,51,[84][85][86] gave a rationale about their sample size, and only three of them [42,51,85] presented the formula for their sample size calculation.As a rule of thumb, some researchers recommended that sample sizes larger than 30 and less than 500 are appropriate for most research [87][88][89].However, it is also recommended that a good maximum sample size is approximately 10% of the population, as long as this does not exceed 1,000 [90,91].Further research is needed to investigate this issue.It is difficult for us to recommend any specific sample size; it all depends on the study setting.An ideal sample size should, however, be sufficiently large to allow the researchers to estimate reliable results [92].

Mode of administration
Face-to-face interviews (40.6%) and web-based surveys (39.1%) were the most frequently applied modes of administration.Different modes of administration might affect the study results as well [8].However, given the complexity of the task, we would recommend face-to-face interviews, if possible, as it enhances understanding and interactions between the interviewers and the respondents; nevertheless, it is also more resource demanding.Digital communication tools such as Skype or Zoom might be considered to reduce travel costs or other related factors.

Hypothetical scenarios
Regarding the context of the hypothetical scenario, there is no strong evidence that one method is favored over the other (37.5% vs 42.2%).In line with previous studies, threshold values from the ex ante might be higher than those from the ex post [47,56,59].Some have argued that ex ante may lead to higher uncertainty than ex post, as the ex post respondents consider other factors, such as income [47].The ex ante is generally appropriate for identifying preferences in the case of a life-threatening disease [47].However, in deciding whether to use ex ante or ex post scenarios for setting up a hypothetical scenario, one needs to evaluate carefully, together with other factors.
Regarding the type of hypothetical scenario, most studies were not specific to any disease/ illness (62.5%), and the arguments concern its easy implementation.For those studies that applied a specific disease in the hypothetical scenario, the threshold value was positively associated with disease severity; for example, a severe cancer scenario would lead to higher threshold values [6,51,93] than mild ones, such as facial reanimation [34,36].Further investment is needed to determine whether multiple threshold values should be applied within a country, i.e., according to the disease severity or specific population, such as children.The disadvantage of a single threshold value is that for patients with severe disease such as cancer or acute or fatal diseases, it is less likely that the relevant treatment will not be reimbursed, as the relevant treatment costs are high.Therefore, it might be reasonable to consider having multiple thresholds within a country.However, this must be balanced with local health budget setting.

Type of QALY gain
The type of QALY gain largely impacts the threshold value, with the life-saving scenario giving the highest value, followed by the life extension scenario and the quality of life improvement scenario [44,59,94].The above findings may support the establishment of different thresholds for different health scenarios.In some countries, such as England and the Netherlands, separate higher thresholds for end-of-life treatments are applied [95][96][97].We recommend investigating different life scenarios when eliciting thresholds in a country.
For informed QALY gain, the respondents will be informed about the magnitude or size of QALY; for uninformed QALY gain, the respondents will not be informed about the size of QALY gain.The former is applied more than the latter (64% versus 31%).WTP varies reversely with the magnitude of QALY gain [3,16,17,59], and higher WTP was associated with smaller QALY gain [22,42,45,98].

Payment vehicle
Regarding payment, lump sum payment and paying in installments were the most frequently applied methods; however, nearly one-fifth of the studies did not report which method was applied.Different methods of payment vehicles may also impact the threshold values, although the pay in installment method might be associated with a higher threshold value relative to the pay in lump sum method, as the former allows respondents to pay more than once to avoid facing ceiling effect later [29,30,37,39].The choice of payment vehicle, however, needs to fit the context in the country, i.e., remain in line with the payment/reimbursement system for health care.

Health preference-eliciting methods
For methods estimating the health preference score, the direct methods have gained popularity over the PBM (43.8% vs. 34.4%).Among the direct methods, it is most popular to apply a rating scale, either alone or mixed with SG or TTO.However, it is arguable whether the rating scale is appropriate for eliciting health utility, as it is not a choice-based method [61].SG or TTO may be considered, as these methods are choice based and are more often recommended by economists compared to VAS [61].However, it might be challenging to ask those questions, which is why PBMs are often applied to bypass the SG and TTO for estimating health utility [61].Among the applied PBMs, the EQ-5D instrument was the most popular (29.7%) because it is available in many different language versions, including for LMICs, and local tariffs or neighboring countries' tariffs might be available [99].To estimate the health preference score, we would recommend that the researcher first check if there is any PBM available in the local language and whether a local tariff or neighbouring country is also available.

Willingness to pay-eliciting methods (WEM)
The stated preference method was the most common choice for obtaining threshold values.Relative with this method, the reveal preference method requires data from actual behavior to derive values for health gain [62,63] which may not be available systematically in LMICs.Meanwhile, the state preference is easier to include a wide range of scenarios, thus requiring a smaller sample size and fewer resources for conducting the study, which could be favored by LMIC.However, when using hypothetical scenarios, respondents might face challenges in imagining all the relevant components of all the scenarios, including hypothetical conditions, severity, reached outcomes, risks, or duration of scenarios [59].Therefore, one must bear in mind that the hypothetical scenario should be carefully constructed and with proper guidelines so that the respondents can understand their task well and give reliable answers.
Regarding the WTP eliciting method, the contingent valuation method dominates (89.1%), although it contains a wide range of different approaches, such as open-ended questions, bidding games, and card sorting.Many researchers (43.8%) would mix at least two methods, usually an open-ended question with some other contingent valuation methods, to obtain a more reliable estimation.DCE has become more popular recently as it might be easy to understand for the respondent [3], decreasing the cognitive burden and the complexity of the survey, as well as the measurement error [100].However, the design of the DCE task is rather complex, and it is challenging to evaluate whether the design has reached sufficient efficiency [101].Using DCE to elicit stated preference, the choices are only defined by the WTP measure without involving health preference, and this method does not account for individual preference heterogeneity [102].For LMICs, we would be more encouraged to use contingent valuation methods to elicit WTP in real situations.

WTP/QALY combination method
There are two methods for combining WTP and QALY: the aggregated and disaggregated approaches, where the latter tends to generate higher threshold values than the former [59,103].The advantage of the disaggregated approach is that all individuals' WTP for a QALY gain is imputed directly into the calculation of the mean value, but the analysis will exclude the non-traders (their WTP is 0) and respondents expressing a QALY gain of zero.The advantage of the aggregated method is its simplicity and inclusion of all respondents.However, this method does not consider the heterogeneity in preferences across individuals.In fact, some authors support the aggregated methods because of the internal consistency properties (the problem of zeros), while others account for individual WTP per QALY ratios [103].We recommend that the choice of analysis be considered carefully, as it must be suitable for the characteristics of the data collected [104].

Relation between WTP/QALY and GDP per capita
We also found that values for WTP/QALY were below 1 x GDP per capita in most countries despite the county's income level.This might suggest that the WHO recommendation of applying 1-3 GDP is inappropriate, which might lead to a budget deficit because treatments could be reimbursed due to an overly high threshold.A specific high threshold could, for example, be considered cases of severe diseases or terminal illness; however, this should be introduced with clear standards/criteria and justifications to avoid funding detriment [59,95].

Strength and limitations
This systematic review provided a comprehensive and in-depth investigation of existing studies eliciting WTP per QALY from the direct approach and compared the existing threshold value with the WHO recommended value.Our research work provided deep insights into the different methods applied to eliciting WTP/QALY, as well as key points to consider when conducting such studies, especially in the context of LMICs.To the best of our knowledge, our study was the first with a comprehensive synthesis of the method, relevant characteristics and results of studies that elicited WTP per QALY.The application of BMA accounts for the uncertainty in variable selection by averaging over the best models, in contrast with the traditional model building strategies such as the stepwise methods, which may result in biased estimates and overly narrow confidence intervals [74].
There are, however, a few limitations need to be addressed.First, some studies did not report the time when the research was conducted, and we used the publication year instead.Sample size was evaluated based on the total sample size of the study, not the sample size of each value of WTP per QALY.Furthermore, as our recommendation was mostly based on studies from high-and upper middle-income countries (97%), a cautious need to be taken to perform WTP/QALY studies in low middle-and low-income countries, and further investigations are needed to better understand the WTP/QALY in the above context.

Recommendations for LMICs
The utilization of the demand-side direct approach (WTP/QALY) may offer a more practical means of establishing a national threshold value within LMICs, primarily due to resource constraints and data limitations.However, this approach should be employed with thoughtful assessment of its methodological precision, applicability, and ethical consequences.Collaborative endeavors involving policymakers, researchers, and stakeholders are encouraged to establish strong and acceptable cost-effectiveness thresholds using the WTP/QALY direct approach in LMICs.To better understand the methodological barriers associated with performing WTP per QALY in LMICs, especially in low-income countries, more studies are needed in those countries.Qualitative studies, in particular, focusing on how the respondents answer the relevant questions and the stakeholder's view about threshold value in those countries, hold particular significance and warrant further investigation.Based on findings from this review, we recommend that: • A societal perspective might be more theoretically convincing for estimating threshold value.
• The general population shall be applied for eliciting national threshold value.
• A sufficiently large sample size that allows the researchers to estimate reliable results.
• Face-to-face interviews are recommended for mode of administration.
• The hypothetical scenario shall not be limited to any specific disease; whether using ex ante or ex post scenario one shall evaluate carefully, together with other factors.
• Different life scenarios (life-saving scenario, life extension scenario, quality of life improvement scenario) should be investigated.
• The choice of payment vehicle should depend on the context in the country (i.e., which is the most in line with the payment/reimbursement system for health in that country).
• PBMs are recommended for eliciting health preferences, given that a PBM is available in the local language and a local tariff or neighbouring country is also available.If PBM is not available, either SG or TTO can be considered.
• The combination of at least two contingent valuation method(s) is recommended, usually an open-ended question with some other contingent valuation methods, to obtain more reliable estimations.
• The choice of WTP/QALY combination method should be considered carefully, which should be suitable for the characteristics of the data collected.
• The collaborative efforts involving policymakers, researchers, and stakeholders are vital to establish robust and widely accepted thresholds.

Conclusions
Methods for deriving WTP/QALY vary largely across studies.Eleven influential factors were identified that contribute most to the level of WTP/QALY value, in which the discrete-choice experiment method had the greatest effect.In most countries, values for WTP/QALY were below GDP per capita; therefore, in case research has not been done, the threshold suggested for LMICs is located around under GDP per capita.Some important principles are addressed related to what LMICs may be concerned with when conducting studies to estimate WTP/ QALY.
. The detailed interpretation of Fig 1 is presented in S1 Text.

Fig 4
Fig 4 demonstrates the boxplot diagram of willingness to pay per QALY by country converted into the 2021 international dollar.Sweden, the Netherlands, Finland and Israel had a large variability in WTP/QALY results.However, the median WTP/QALY of all countries was generally below 150,000.

Mix of the direct elicited health preference method and PBM
[6]]s://doi.org/10.1371/journal.pone.0297450.t003dollars in 2021 (i$)[75]was calculated, presented in S3 Table.In general, the median WTP per QALY of countries varied significantly from i$2,643.6 to i$145,833.8,withthelowest in Greece and the largest value in Bulgaria.The study in Bulgaria interviewed doctors, with metastatic cancer as a hypothetical scenario; hence, this resulted in a high WTP/QALY value and high ratio of WTP per QALY per GDP per capita[6].However, the median WTP per QALY for all countries was i$16,647.6,while the median WTP per QALY per GDP per capita of